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Abstract 

The SOM algorithm is very astonishing. On the one hand, it is very simple 
to write down and to simulate, its practical properties are clear and easy to 
observe. But, on the other hand, its theoretical properties still remain without 
proof in the general case, despite the great efforts of several authors. In this 
paper, we pass in review the last results and provide some conjectures for the 
future work. 

Keywords: Self-organization, Kohonen algorithm, Convergence of stochas- 
tic processes, Vectorial quantization. 

1 Introduction 

The now very popular SOM algorithm was originally devised by Teuvo Kohonen 
in 1982 [35] and [36]. It was presented as a model of the self-organization of neu- 
ral connections. What immediatly raised the interest of the scientific community 
(neurophysiologists, computer scientists, mathematicians, physicists) was the abil- 
ity of such a simple algorithm to produce organization, starting from possibly total 
disorder. That is called the self-organization property. 

As a matter of fact, the algorithm can be considered as a generalization of the 
Competitive Learning, that is a Vectorial Quantization Algorithm [42], without any 
notion of neighborhood between the units. 



In the SOM algorithm, a neighborhood structure is defined for the units and is 
respected throughout the learning process, which imposes the conservation of the 
neighborhood relations. So the weights are progressively updated according to the 
presentation of the inputs, in such a way that neighboring inputs are little by little 
mapped onto the same unit or neighboring units. 

There are two phases. As well in the practical applications as in the theoretical 
studies, one can observe self-organization first (with large neighborhood and large 
adaptation parameter), and later on convergence of the weights in order to quantify 
the input space. In this second phase, the adaptation parameter is decreased to 0, 
and the neighborhood is small or indeed reduced to one unit (the organization is 
supposed not to be deleted by the process in this phase, that is really true for the 
0- neighbor setting). 

Even if the properties of the SOM algorithm can be easily reproduced by simu- 
lations, and despite all the efforts, the Kohonen algorithm is surprisingly resistant 
to a complete mathematical study. As far as we know, the only case where a com- 
plete analysis has been achieved is the one dimensional case (the input space has 
dimension 1) for a linear network (the units are disposed along a one-dimensional 
array) . 

A sketch of the proof was provided in the Kohonen's original papers [35], [36] 
in 1982 and in his books [37], [40] in 1984 and 1995. The first complete proof 
of both self-organization and convergence properties was established (for uniform 
distribution of the inputs and a simple step-neighborhood function) by Cottrell and 
Fort in 1987, [9]. 

Then, these results were generalized to a wide class of input distributions by 
Bouton and Pages in 1993 and 1994, [6], [7J and to a more general neighborhood by 
Erwin et al. (1992) who have sketched the extension of the proof of self-organization 
[2T] and studied the role of the neighborhood function [2D] . Recently, Sadeghi |59j, 
[60] has studied the self-organization for a general type of stimuli distribution and 
neighborhood function. 

At last, Fort and Pages in 1993, [26], 1995 [27], 1997 [3], [4] (with Benaim) have 
achieved the rigorous proof of the almost sure convergence towards a unique state, 
after self-organization, for a very general class of neighborhood functions. 

Before that, Ritter et al. in 1986 and 1988, [52], [53] have thrown some light on 
the stationary state in any dimension, but they study only the final phase after the 
self-organization, and do not prove the existence of this stationary state. 

In multidimensional settings, it is not possible to define what could be a well 
ordered configuration set that would be stable for the algorithm and that could be 
an absorbing class. For example, the grid configurations that Lo et al. proposed 
in 1991 or 1993, [45], [46] are not stable as proved in [ID]. Fort and Pages in 1996, 
[25] show that there is no organized absorbing set, at least when the stimuli space 
is continuous. On the other hand, Erwin et al. in 1992 [2T] have proved that it 
is impossible to associate a global decreasing potential function to the algorithm, as 
long as the probability distribution of the inputs is continuous. Recently, Fort and 



Pages in 1994, [26J , in 1996 [27] and [28], Flanagan in 1994 and 1996 [22], [23] gave 
some results in higher dimension, but these remain incomplete. 

In this paper, we try to present the state of the art. As a continuation of previous 
paper [13], we gather the more recent results that have been published in different 
journals that can be not easily get-a-able for the neural community. 

We do not speak about the variants of the algorithm that have been defined and 
studied by many authors, in order to improve the performances or to facilitate the 
mathematical analysis, see for example [5], [UJ, [58], |61j. We do not either address 
the numerous applications of the SOM algorithm. See for example the Kohonen's 
book [40] to have an idea of the profusion of these applications. We will only mention 
as a conclusion some original data analysis methods based on the SOM algorithm. 

The paper is organized as follows: in section 2, we define the notations. The 
section 3 is devoted to the one dimensional case. Section 4 deals with the multidi- 
mensional 0-neighbor case, that is the simple competitive learning and gives some 
light on the quantization performances. In section 5, some partial results about 
the multidimensional setting are provided. Section 6 treats the discrete finite case 
and we present some data analysis methods derived from the SOM algorithm. The 
conclusion gives some hints about future researches. 

2 Notations and definitions 

The network includes n units located in an ordered lattice (generally in a one- or 
two-dimensional array). If / = {1, 2, . . . , n} is the set of the indices, the neighbor- 
hood structure is provided by a neighborhood function A defined on I x I. It is 
symmetrical, non increasing, and depends only on the distance between i and j in 
the set of units /, (e.g. \ i—j | if / = {1, 2, . . . , n} is one-dimensional). A(z,j) 
decreases with increasing distance between i and j, and A(i,i) is usually equal to 1. 

The input space Q is a bounded convex subset of lZ d , endowed with the Eu- 
clidean distance. The inputs x(t),t > 1 are Q- valued, independent with common 
distribution 

The network state at time t is given by 

m(t) = (mi(t), m 2 {t), m n {t)). 

where rrii{t) is the d- dimensional weight vector of the unit i. 

For a given state m and input x, the winning unit i c (x,m) is the unit whose 
weight m ic ( x ^ is the closest to the input x. Thus the network defines a map 
$ m : x i — ► i c (x,m), from Q to /, and the goal of the learning algorithm is to 
converge to a network state such the $ m map will be "topology preserving" in some 
sense. 

For a given state m, let us denote Cj(m) the set of the inputs such that i is the 
winning unit, that is Cj(m) = $~ x (i). The set of the classes Cj(m) is the Euclidean 
Vorono'i tessellation of the space Q related to m. 



The SOM algorithm is recursively defined by 



( 



i c (x(t + l),m(t)) 
miit + l) 



argmin {\\x{t + 1) — mi(t)\\,i G /} 
rrii(t) -e t A(i ,i)(mi(t) -x(t + l)),Vi G I 



(1) 



The essential parameters are 

• the dimension d of the input space 

• the topology of the network 

• the adaptation gain parameter e t , which is ]0, 1 [-valued, constant or decreasing 
with time, 

• the neighborhood function A, which can be constant or time dependent, 

• the probability distribution fi. 

Mathematical available techniques 

As mentioned before, when dealing with the SOM algorithm, one has to separate 
two kinds of results: those related to self-organization, and those related to conver- 
gence after organization. In any case, all the results have been obtained for a fixed 
time-invariant neighborhood function. 

First, the network state at time t is a random ^"-valued vector m(t) displaying 
as : 



(where H is defined in an obvious way according to the updating equation) is a 
stochastic process. If e t and A are time- invariant, it is an homogeneous Markov chain 
and can be studied with the usual tools if possible (and fruitful). For example, if 
the algorithm converges in distribution, this limit distribution has to be an invariant 
measure for the Markov chain. If the algorithm has some fixed point, this point 
has to be an absorbing state of the chain. If it is possible to prove some strong 
organization [28], it has to be associated to an absorbing class. 

Another way to investigate self-organization and convergence is to study the asso- 
ciated ODE (Ordinary Differential Equation) [41 J that describes the mean behaviour 
of the algorithm : 




(2) 



dm 
~~dt 



h(m) 



(3) 



where 





is the expectation of H (., m) with respect to the probability measure \x. 
Then it is clear that all the possible limit states m* are solutions of the functional 
equation 



and any knowledge about the possible attracting equilibrium points of the ODE 
can give some light about the self-organizing property and the convergence. But 
actually the complete asymptotic study of the ODE in the multidimensional setting 
seems to be untractable. One has to verify some global assumptions on the function 
h (and on its gradient) and the explicit calculations are quite difficult, and perhaps 
impossible. 

In the convergence phase, the techniques depend on the kind of the desired con- 
vergence mode. For the almost sure convergence, the parameter e t needs to decrease 
to 0, and the form of equation (T2J) suggests to consider the SOM algorithm as a 
Robbins-Monro [57] algorithm. 

The usual hypothesis on the adaptation parameter to get almost sure results is 
then: 



The less restrictive conditions J2t £ t = +00 and Et \ generally do not ensure the 
almost sure convergence, but some weaker convergence, for instance the convergence 
in probability. 

Let us first examine the results in dimension 1. 

3 The dimension 1 
3.1 The self-organization 

The input space is [0,1], the dimension d is 1 and the units are arranged on a linear 
array. The neighborhood function A is supposed to be non increasing as a function 
of the distance between units, the classical step neighborhood function satisfies this 
condition. The input distribution \i is continuous on [0, 1]: this means that it does 
not weight any point. This is satisfied for example by any distribution having a 
density. 
Let us define 



h(m) = 



= +oo and < +oo. 



(5) 




{m G7l/0<mi<m 2 <. 



< m n < 1} 



and 



F n = {m G 1Z 1 < m n < m„,_i < . . . < mi < 1}. 
In [9], [6], the following results are proved using Markovian methods : 



Theorem 1 (i) The two sets F£ and F~ are absorbing sets. 

(ii) If e is constant, and if A is decreasing as a function of the distance (e.g. if there 
are only two neigbors) the entering time r, that is the hitting time of U F~ , is 
almost surely finite, and 3A > 0, s.t. sup mg [ 01 ]n E m (exp(Xr)) is finite, where E m 
denote the expectation given m(0) = m. 

The theorem [I] ensures that the algorithm will almost surely order the weights. 
These results can be found for the more particular case (/i uniform and two neigh- 
bors) in Cottrell and Fort [9], 1987, and the succesive generalisations in Erwin et 
al. [21J, 1992, Bouton and Pages [6], 1993, Fort and Pages [27], 1995, Flanagan [23], 
1996. 

The techniques are the Markov chain tools. 

Actually following [6], it is possible to prove that whenever e \ and J2 £ t — +oo, 
then Vm G [0, l] n , Proba m (r < +oo) > 0, (that is the probability of self-organization 
is positive regardless the initial values, but not a priori equal to 1). In [60], Sadeghi 
uses a generalized definition of the winner unit and shows that the probability of 
self-organization is uniformly positive, without assuming a lower bound for e t . 

No result of almost sure reordering with a vanishing e t is known so far. In [10] , Cot- 
trell and Fort propose a still not proved conjecture: it seems that the re-organization 
occurs when the parameter e t has a order. 

3.2 The convergence for dimension 1 

After having proved that the process enters an ordered state set (increasing or 
decreasing), with probability 1, it is possible to study the convergence of the process. 
So we assume that m(0) G F+. It would be the same if m(0) G F~. 

3.2.1 Decreasing adaptation parameter 

In [9] (for the uniform distribution), in [7], [27] and more recently in [3], [I], 1997, 
the almost sure convergence is proved in a very general setting. The results are 
gathered in the theorem below : 

Theorem 2 Assume that 

1) (e t ) g]0, 1[ satisfies the condition $5$, 

2) the neighborhood function satisfies the condition H\: there exists ko < such 
that A(k + 1) < A(fc ), 

3) the input distribution satisfy the condition H^: it has a density f such that 
f > on ]0, 1[ and ln(/) is strictly concave (or only concave, with lim + / + limx- / 
positive), 

Then 

(i) The mean function h has a unique zero m* in F^ . 

(ii) The dynamical system ^ = —h{m) is cooperative on F+ , i.e. the non diagonal 
elements of V/i(m) are non positive. 



(Hi) m* is attracting. 
So i/m(0) G F£, m(t) m* almost surely. 

In this part, the authors use the ODE method, a result by M.Hirsch on cooperative 
dynamical system jM], and the Kushner & Clark Theorem [5T], [3]. A.Sadeghi put 
in light that the non-positivity of non-diagonal terms of Vh is exactly the basic 
definition of a cooperative dynamical system and he obtained partial results in [59] 
and more general ones in [60] . 

We can see that the assumptions are very general. Most of the usual probability 
distributions (truncated on [0, 1]) have a density / such that ln(/) is strictly concave. 
On the other hand, the uniform distribution is not strictly ln-concave as well as the 
truncated exponential distribution, but both cumply the condition lim + / + limi- / 
positive. 

Condition (jHJ) is essential, because if e t \ and J2t £ t = +oo, there is only a 
priori convergence in probability. 

In fact, by studying the associated ODE, Flanagan [22] shows that before ordering, 
it can appear metastable equilibria. 

In the uniform case, it is possible to calculate the limit m*. Its coordinates are 
solutions of a (n x n)-linear system which can be found in [37J or |9J. An explicit 
expression, up to the solution of a 3 x 3 linear system is proposed in [5J. Some 
further investigations are made in [31 J . 

3.2.2 Constant adaptation parameter 

Another point of view is to study the convergence of m(t) when e t = e is a constant. 
Some results are available when the neighborhood function corresponds to the two- 
neighbors setting. See [9], 1987, (for the uniform distribution) and [TJ, 1994, for the 
more general case. One part of the results also hold for a more general neighborhood 
function, see [3J, [I]- 

Theorem 3 Assume that m(0) G F+ ; 

Part A: Assume that the hypotheses and H\ hold as in Theorem^ then 
For each e g]0, 1[, there exists some invariant probability v e on F^ . 
Part B: Assume only that A(i,j) = 1 if and only if \i — j\ = or 1 (classical 
2-neighbors setting), 

(i) If the input distribution fi has an absolutely continuous part (e.g. has a density), 
then for each e g]0, 1[, there exists a unique probability distribution v £ such that the 
distribution o/m* weakly converges to v e when t — > oo. The rate of convergence is 
geometric. Actually the Markov chain is Doeblin recurrent. 

(ii) Furthermore, if /i has a positive density, We, v s is equivalent to the Lebesgue 
measure on F^ if and only if n is congruent with or 1 modulo 3. If n is congruent 
with 2 modulo 3, the Lebesgue measure is absolutely continuous with respect to v e , 
but the inverse is not true, that is v e has a singular part. 



Part C: With the general hypotheses of Part A (which includes that of Part B), if 
m* is the unique globally attractive equilibrium of the ODE (see Theorem^), thus 
v e converges to the Dirac distribution on m* when e \ . 

So when e is very small, the values will remain very close to m*. 

Moreover, from this result we may conjecture that for a suitable choice of Et, 
certainly e t = ^7, where A is a constant, both self-organization and convergence 
towards the unique m* can be achieved. This could be proved by techniques very 
similar to the simulated annealing methods. 



4 The neighbor case in a multidimensional set- 
ting 

In this case, we take any dimension d, the input space is Q C TZ d and A(z, j) = 1 if 
i = j, and elsewhere. There is no more topology on J, and reordering no makes 
sense. In this case the algorithm is essentially a stochastic version of the Linde, Gray 
and Buzo [H] algorithm (LBG). It belongs to the family of the vectorial quantization 
algorithms and is equivalent to the Competitive Learning. The mathematical results 
are more or less reachable. Even if this algorithm is deeply different from the usual 
Kohonen algorithm, it is however interesting to study it because it can be viewed 
as a limit situation when the neighborhood size decreases to 0. 

The first result (which is classical for Competitive learning), and can be found in 
El, EDI, ESI is: 



Theorem 4 (i) The 0-neighbor algorithm derives from the potential 

V n (m) = - [ min \\rrii — x\\ 2 dfi(x) (6) 

2 J l<i<n 



(ii) If the distribution probability \i is continuous (for example fi has a density f), 

1 A r „ „,„,,. 1 

ICi(m) 



1 n [ If 

Vn(m) = - V / \\rrii - x\\ 2 f(x)dx = - min \\rrii - x\\ 2 f(x)dx (7) 

2 —{Jc^m) 2 J i<*<« 



where Cj(m) is the Voronoi set related with the unit i for the current state m. 

The potential function V n (m) is nothing else than the intra-classes variance used 
by the statisticians to characterize the quality of a clustering. In the vectorial quan- 
tization setting, V n (m) is called distortion. It is a measure of the loss of information 
when replacing each input by the closest weight vector (or code vector). The po- 
tential V n (m) has been extensively studied since 50 years, as it can be seen in the 
Special Issue of IEEE Transactions on Information Theory (1982), [42J. 

The expression (I7|) holds as soon as m; 7^ rrij for all i 7^ j and as the borders of 
the Voronoi classes have probability 0, (/i(U" =1 <9Cj(m)) = 0). This last condition is 



always verified when the distribution /x has a density /. With these two conditions, 
V(m) is differentiable at m and its gradient vector reads 

W„(m) = ( I im ,- x)fm ). 

\JCi(m) J 

So it becomes clear ([50], [40]) that the Kohonen algorithm with neighbor is the 
stochastic gradient descent relative to the function V n (m) and can be written : 

m(t + 1) = m{t) - e t+ xl Ci ( m (t)){x{t + l))(m(t) - x{t + 1)) 
where \ci(m(t)){ x {P + 1)) is equal to 1 if x(t + 1) G Cj(m(t)), and if not. 

The available results are more or less classical, and can be found in [44] and [8], 
for a general dimension d and a distribution \x satisfying the previous conditions. 

Concerning the convergence results, we have the following when the dimension 
d = 1, see Pages ([50], [51]), the Special Issue in IEEE [42] and also [43] for (ii): 

The parameter e(i) has to satisfy the conditions (JSJ). 

Theorem 5 Quantization in dimension 1 

(i) If W n has finitely many zeros in , m(t) converges almost surely to one of 
these local minima. 

(ii) If the hypothesis holds (see Theorem $B)), V n has only one zero point in 
, say m*. This point G F^ and is a minimum. Furthermore if m(0) G F+ , 

f ,\ a.s. , 

m(t) — > m n . 

(Hi) If the stimuli are uniformly distributed on [0, 1], then 

m* = ((2z - l)/2n)i<i<„. 

The part (ii) shows that the global minimum de V n (m) is reachable in the one- 
dimensional case and the part (Hi) is a confirmation of the fact that the algorithm 
provides an optimal discretization of continous distributions. 

A weaker result holds in the (i-dimensional case, because one has only the conver- 
gence to a local minimum of V n (m). 

Theorem 6 Quantization in dimension d 

If W n has finitely many zeros in F£, and if these zeros have all their components 
pairwise distinct, m(t) converges almost surely to one of these local minima. 

In the rf-dimensional case, we are not able to compute the limit, even in the 
uniform case. Following [48] and many experimental results, it seems that the 
minimum distortion could be reached for an hexagonal tesselation, as mentioned in 
ED] or [40]. 

In both cases, we can set the properties of the global minima of V n (m), in the 
general (i-dimensional setting. Let us note first that V n (m) is invariant under any 
permutation of the integers 1,2, ... ,n. So we can consider one of the global minima, 
the ordered one (for example the lexicographically ordered one). 



Theorem 7 Quantization property 

(i) The function V n (m) is continuous on (lZ d ) n and reaches its (global) minima 
inside fl n . 

(ii) For a fixed n, a point m* at which the function V n is minimum has pairwise 
distinct components. 

(Hi) Let n be a variable and m* = (m* 1; m* 2 , . . . , m* n ) the ordered minimum of 
V n {m). The sequence min^^n V n {m) = V n (m^) converges to as n goes to +00. 
More precisely, there exists a speed (3 = 2/d and a constante A(f) such that 

n p V n {rri$ — > A(f) 

when n goes to +00. 

Following Zador [6$ , the constant A(f) can be computed, A(f) — || / \\ p , 
where does not depend on f, p = d/(d + 2) and \\ f \\ p = [f f p (x)dx] 1 ^ p . 

(iv) Then, the weighted empirical discrete probability measure 

n 
1=1 

converges in distribution to the probability measure fi, when n — > 00. 

(v) If F n (resp. F) denotes the distribution function of \i n (resp. fi), one has 

min VJm) = min / (F n (x) — F(x)) 2 dx, 
(iz d ) n (Ti d ) n Jn 

so when n — > 00, F n converges to F in quadratic norm. 

The convergence in (iv) properly defines the quantization property, and explains 
how to reconstruct the input distribution from the n code vectors after convergence. 
But in fact this convergence holds for any sequence y* = yi, n ,y2,n, ■ ■ ■ ,Vn,n, which 
"fills " the space when n goes to +00: for example it is sufficient that for any n, 
there exists an integer n' > n such that in any interval yi )Tl , yi+i, n (in 7Z d ), there are 
some points of y*,. But for any sequence of quantizers satisfying this condition, even 
if there is convergence in distribution, even if the speed of the convergence can be 
the same, the constant A(f) will differ since it will not realize the minimum of the 
distortion. 

For each integer n, the solution m* which minimizes the quadratic distortion 
V n {m) and the quadratic norm || F n — F || 2 is said to be an optimal n-quantizer 
. It ensures also that the discrete distribution function associated to the minimum 
m* suitably weighted by the probability of the Voronoi classes, converges to the 
initial distribution function F. So the 0-neighbor algorithm provides a skeleton of 
the input distribution and as the distortion tends to as well as the quadratic norm 
distance of F n and F, it provides an optimal quantizer. The weighting of the Dirac 
functions by the volume of the Voronoi classes implies that the distribution fi n is 



usually quite different from the empirical one, in which each term would have the 
same weight 1/n. 

This result has been used by Pages in [50J and [5T] to numerically compute inte- 
grals. He shows that the speed of convergence of the approximate integrals is exactly 

2 

n& for smooth enough functions, which is faster than the Monte Carlo method while 
d < 4. 

The difficulty remains that the optimal quantizer m* is not easily reachable, since 
the stochastic process m(t) converges only to a local minimum of the distortion, 
when the dimension is greater than 1. 

Magnification factor 

There is some confusion [37], [52], between the asymptotic distribution of an 
optimal quantizer m* when n — > oo and that one of the best random quantizer, as 
defined by Zador [M] in 1982. 

The Zador's result, extended to the multi-dimensional case, is as follows : Let 
f be the input density of the measure \i, and (Yi,Y 2 , . . . ,Y n ) a random quantizer, 
where the code vectors Yi are independent with common distribution of density g. 

Then, with some weak assumptions about f and g, the distortion tends to when 
n — ► oo, with speed (3 = 2/d, and it is possible to define the quantity 

n , 

A(f,g)=hm Q n p E g ["£ / \\Y t - x\\ 2 f(x)dx] 

Then for any given input density f , the density g (assuming some weak condition) 
which minimises A(f,g) is 

g* ~ C f d / d+2 . 

The inverse of the exponent d/(d + 2) is refered as Magnification Factor. Note 
that in any case, when the data dimension is large, this exponent is near 1 (it value 
is 1/3 when d = 1). Note also that this power has no effect when the density / is 
uniform. But in fact the optimal quantizer is another thing, with another definition. 

Namely the optimal quantizer m* (formed with the code vectors m\ n , rn* n n ), 

minimizes the distortion V n (m), and is got after convergence of the 0- neighbor al- 
gorithm (if we could ensure the convergence to a global minimum, that is true only 
in the one-dimensional case). So if we set 

n „ 

An(f,m*) =n p V n {m* n ) =n^ / IK,« - xff{x)dx 

4=1 Ci 

actually we have, 

A(f)=Jim oo A n (f,K)<A(f, 9 n 

and the limit of the discrete distribution of m* is not equal to g*. So there is no 
magnification factor, for the 0-neighbor algorithm as claimed in many papers. It can 
be an approximation, but no more. 



The problem comes from the confusion between two distinct notions: random 
quantizer and optimal quantizer. And in fact, the good property is the convergence 
of the weighted distribution function (j7j). 

As to the SOM algorithm in the one-dimensional case, with a neighborhood func- 
tion not reduced to the 0-neighbor case, one can find in [55] or [19] some result 
about a possible limit of the discrete distribution when the number of units goes to 
oo. But actually, the authors use the Zador's result which is not appropriate as we 
just see. 

5 The multidimensional continuous setting 

In this section, we consider a general neighborhood function and the SOM algorithm 
is defined as in Section 2. 

5.1 Self-organization 

When the dimension d is greater than 1, little is known on the classical Kohonen 
algorithm. The main reason seems to be the fact that it is difficult to define what can 
be an organized state and that no absorbing sets have been found. The configurations 
whose coordinates are monotoneous are not stable, contrary to the intuition. For 
each configuration set which have been claimed to be left stable by the Kohonen 
algorithm, it has been proved later that it was possible to go out with a positive 
probability. See for example [TO]. Most people think that the Kohonen algorithm 
in dimension greater than 1 could correspond to an irreducible Markov chain, that 
is a chain for which there exists always a path with positive probability to go from 
anywhere to everywhere. That property imply that there is no absorbing set at all. 

Actually, as soon as d > 2, for a constant parameter e, the 0-neighbor algorithm 
is an Doeblin recurrent irreducible chain (see [7]), that cannot have any absorbing 
class. 

Recently, two apparently contradictory results were established, that can be col- 
lected together as follows. 

Theorem 8 (d — 2 and e is a constant) Let us consider a n x n units square 
network and the set F ++ of states whose both coordinates are separately increasing 
as function of their indices, i.e. 

F ++ = {Vzi < n,m 2 h l < m\ 2 < ... < ro- 1>n ,Vi 2 < n,m\ i2 < m\ l2 < ... < m^ 2 } 

(i) If fi has a density on Q, and if the neighborhood function A is everywhere posi- 
tive and decreases with the distance, the hitting time of F ++ is finite with positive 
probability (i.e. > 0, but possibly less than 1). See Flanagan (12^, WBI). 

(ii) In the 8-neighbor setting, the exit time from F ++ is finite with positive proba- 
bility. See Fort and Pages in (fEBj)- 



This means that (with a constant, even very small, parameter e), the organi- 
zation is temporarily reached and that even if we guess that it is almost stable, 
dis-organization may occur with positive probability. 

More generally, the question is how to define an organized state. Many authors 
have proposed definitions and measures of the self-organization, [65], [18], [62], |32j, 
[63] , |33j . But none such "organized" sets have a chance to be absorbing. 

In [28], the authors propose to consider that a map is organized if and only if the 
Voronoi classes of the closest neighboring units are contacting. They also precisely 
define the nature of the organization (strong or weak). 

They propose the following definitions : 

Definition 1 Strong organization 

There is strong organization if there exists a set of organized states S such that 

(i) S is an absorbing class of the Markov chain m(t), 

(ii) The entering time in S is almost surely finite, starting from any random weight 
vectors (see JMj). 

Definition 2 Weak organization 

There is weak organization if there exists a set of organized states S such that all 
the possible attracting equilibrium points of the ODE defined in [3] belong to the set 
S. 

The authors prove that there is no strong organization at least in two seminal 
cases: the input space is [0, l] 2 , the network is one-dimensional with two neighbors or 
two-dimensional with eight neighbors. The existence of weak organization should be 
investigated as well, but until now no exact result is available even if the simulations 
show a stable organized limit behavior of the SOM algorithm. 

5.2 Convergence 

In [27], (see also [26]) the gradient of h is computed in the <i-dimensional setting 
(when it exists). In [53], the convergence and the nature of the limit state is studied, 
assuming that the organization has occured, although there is no mathematical proof 
of the convergence. 

Another interesting result received a mathematical proof thanks to the computa- 
tion of the gradient of h: it is the dimension selection effect discovered by Ritter 
and Schulten (see [S3])- The mathematical result is (see [27J: 

Theorem 9 Assume thatm* is a stable equilibrium point of a general di- dimensional 
Kohonen algorithm, with ri\ units, stimuli distribution \L\ and some neighborhood 
function A. Let \ii be a di- dimensional distribution with mean and covariance 
matrix S 2 . Consider the di + <i 2 Kohonen algorithm with the same units and the 
same neighborhood function. The stimuli distribution is now /ii®/i2. 

Then there exists some rj > 0, such that if ||E 2 || < n, the state m\ in the subspace 
rri2 = m\ is still a stable equilibrium point for the d\ + di algorithm. 



It means that if the stimuli distribution is close to a <ii-dimensional distribution in 
the d\ + d 2 space, the algorithm can find a (ii-space stable equilibrium point. That 

is the dimension selection effect. 

From the computation of the gradient Vh, some partial results on the stability of 
grid equilibriums can also be proved: 

Let us consider J = / 1 x/ 2 x...x/ [i a d- dimensional array, with /; = {1,2, ... ,ni}, 
for 1 < I < d. Let us assume that the neighborhood function is a product function 
(for example 8 neighbors for d = 2) and that the input distributions in each coordi- 
nate are independent, that is fi = fi± ® . . . ® /i^. At last suppose that the support 
of each \ii is [0,1]. 

Let us call grid states the states m* = (m* ;/ , 1 < % x < ni, 1 < I < d), such that 
for every 1 < I < d, (m*;,! < i\ < n{) is an equilibrium for the one-dimensional 
algorithm. Then the following results hold [27J : 

Theorem 10 (i) The grid states are equilibrium points of the ODE ([3j) in the d- 

dimensional case. 

(ii) For d = 2, if fi\ and \x 2 have strictly positive densities j\ and f 2 on [0, 1], if the 
neighborhood functions are strictly decreasing, the grid equilibrium points are not 
stable as soon as n\ is large enough and the ratio 2i is large (or small) enough (i.e. 
when ni — > +oo and ^ — > +oo or 0, see [21^ , Section 4-3). 
(Hi) For d = 2, if Hi and /i 2 have strictly positive densities fi and f 2 on [0, 1], if 
the neighborhood functions are degenerated (0 neighbor case), m* is stable if ni and 
n 2 are less or equal to 2, is not stable in any other case (may be excepted when 
n% = n 2 = 3). 

The (ii) gives a negative property for the non square grid which can be related 
with this one: the product of one-dimensional quantizers is not the correct vectorial 
quantization. But also notice that we have no result about the simplest case: the 
square grid equilibrium in the uniformly distributed case. Everybody can observe by 
simulation that this square grid is stable (and probably the unique stable "organized" 
state). Nevertheless, even if we can numerically verify that it is stable, using the 
gradient formula it is not mathematically proved even with two neighbors in each 
dimension! 

Moreover, if the distribution Hi and \i 2 are not uniform, generally the square grids 
are not stables, as it can be seen experimentally. 

6 The discrete case 

In this case, there is a finite number N of inputs and Q = {x\,x 2 , . . . , xn}- The 
input distribution is uniform on f2 that is jj(dx) = vEili^c It is the setting of 
many practical applications, like Classification or Data Analysis. 



6.1 The results 



The main result ([21], [HE]) is that for not time-dependent general neighborhood, 
the algorithm locally derives from the potential 
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When A(i,j) = 1 if z and j are neighbors, and if V(j) denotes the neighborhood 
of unit z in J, V^(m) also reads 
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V^(m) is an intra-class variance extended to the neighbor classes which is a gen- 
eralization of the distortion defined in Section 4 for the 0-neighbor setting. But this 
potential does have many singularities and its complete analysis is not achieved, 
even if the discrete algorithm can be viewed as a stochastic gradient descent proce- 
dure. In fact, there is a problem with the borders of the Voronoi classes. The set of 
all these borders along the process m(t) trajectories has measure 0, but it is difficult 
to assume that the given points x\ never belong to this set. 

Actually the potential is the true measure of the self-organization. It measures 
both clustering quality and proximity between classes. Its study should provide 
some light on the Kohonen algorithm even in the continuous case. 

When the stimuli distribution is continuous, we know that the algorithm is not a 
gradient descent [21]. However the algorithm can be seen then as an approximation 
of the stochastic gradient algorithm derived from the function V n (m). Namely, 
the gradient of V n (m) has a non singular part which corresponds to the Kohonen 
algorithm and a singular one which prevents the algorithm to be a gradient descent. 

This remark is the base of many applications of the SOM algorithm as well in 
combinatorial optimization, data analysis, classification, analysis of the relations 
between qualitative classifying variables. 



6.2 The applications 

For example, in [23], Fort uses the SOM algorithm with a close one-dimensional 
string, in a two dimensional space where are located M cities. He gets very quickly 
a very good sub-optimal solution. See also the paper []]. 



The applications in data analysis and classification are more classical. The prin- 
ciple is very simple: after convergence, the SOM algorithm provides a two(or one)- 
dimensional organized classification which permit a low dimensional representation 
of the data. See in [40J an impressive list of examples. 

In [15] and [TF], an application to forecasting is presented from a previous classi- 
fication by a SOM algorithm. 

6.3 Analysis of qualitative variables 

Let us define here two original algorithms to analyse the relations between qualitative 
variables. The first one is defined only for two qualitative variables. It is called 
KORRESP and is analogous to the simple classical Correspondence Analysis. The 
second one is devoted to the analysis of any finite number of qualitative variables. 
It is called KACM and is similar to the Multiple Correspondence Analysis. See [H] , 
[H], [16] for some applications. 

For both algorithms, we consider a sample of individuals and a number K of 
questions. Each question k, k = 1, 2, . . . , K has m^ possible answers (or modalities). 
Each individual answers each question by choosing one and only one modality. If 
M = Y,i<k< m k is the total number of modalities, each individual is represented by 
a row M- vector with values in 0,1. There is only one 1 between the 1st component 
and the mi-th one, only one 1 between the mi + 1-th component and the mi + m2-th 
one and so on. 

In the general case where M > 2, the data are summarized into a Burt Table which 
is a cross tabulation table. It is a M x M symmetric matrix and is composed of 
K x K blocks, such that the (k, /)-block (for k ^ I) is the (m^ x mi) contingency 
table which crosses the question k and the question /. The block B^k is a diagonal 
matrix, whose diagonal entries are the numbers of individuals who have respectively 
chosen the modalities 1,2,..., m^ for question k. In the following, the Burt Table 
is denoted by B. 

In the case M = 2, we only need the contingency table T which crosses the two 
variables. In that case, we set p (resp. q) for m\ (resp. m^)- 

The KORRESP algorithm 

In the contingency table T, the first qualitative variable has p levels and corre- 
sponds with the rows. The second one has q levels and corresponds with the columns. 
The entry n^- is the number of individuals categorized by the row % and the column j . 
From the contingency table, the matrix of relative frequencies (/y = n.jj / (E« n ij)) 
is computed. 

Then the rows and the columns are normalized in order to have a sum equal to 
1. The row profile r(i),l < i < p is the discrete probability distribution of the 
second variable given that the first variable has modality i and the column profile 
c(j),l < j < q is the discrete probability distribution of the first variable given 



that the second variable has modality j. The classical Correspondence Analysis is a 
simultaneous weighted Principal Component Analysis on the row profiles and on the 
column profiles. The distance is chosen to be the x 2 distance. In the simultaneous 
representation, related modalities are projected into neighboring points. 

To define the algorithm KORRESP, we build a new data matrix T> : to each row 
profile r(z), we associate the column profile c(j(i)) which maximizes the probability 
of j given i, and conversely, we associate to each column profile c(j) the row profile 
r(i(j)) the most probable given j. The data matrix V is the ((p + q) x (q + p))- 
matrix whose first p rows are the vectors (r(i), c(J(i))) and last q rows are the vectors 
(r(i(j)), c(j)). The SOM algorithm is processed on the rows of this data matrix T>. 
Note that we use the \ 2 distance to look for the winning unit and that we alterna- 
tively pick at random the inputs among the p first rows and the q last ones. After 
convergence, each modality of both variables is classified into a Voronoi class. Re- 
lated modalities are classified into the same class or into neighboring classes. This 
method give a very quick, efficient way to analyse the relations between two quali- 
tative variables. See [11] and [12] for real-world applications. 

The KACM Algorithm 

When there are more than two qualitative variables, the above method does not 
work any more. In that case, the data matrix is just the Burt Table B. The rows are 
normalized, in order to have a sum equal to 1. At each step, we pick a normalized 
row at random according to the frequency of the corresponding modality. We define 
the winning unit according to the y 2 distance and update the weights vectors as 
usual. After convergence, we get an organized classification of all the modalities, 
where related modalities belong to the same class or to neighboring classes. In that 
case also, the KACM method provides a very interesting alternative to classical 
Multiple Correspondence Analysis. 

The main advantages of both KORRESP and KACM methods are their rapidity 
and their small computing time. While the classical methods have to use several 
representations with decreasing information in each, ours provide only one map, 
that is rough but unique and permit a rapid and complete interpretation. See [H] 
and [16J for the details and financial applications. 

7 Conclusion 

So far, the theoretical study in the one-dimensional case is nearly complete. It 
remains to find the convenient decreasing rate to ensure the ordering. For the 
multidimensional setting, the problem is difficult. It seems that the Markov chain 
is irreducible and that further results could come from the careful study of the 
Ordinary Differential Equation (ODE) and from the powerful existing results about 
the cooperative dynamical systems. 



On the other hand, the applications are more and more numerous, especially 
in data analysis, where the representation capability of the organized data is very 
valuable. The related methods make up a large and useful set of methods which 
can be substituted to the classical ones. To increase their use in the statistical 
community, it would be necessary to continue the theoretical study, in order to 
provide quality criteria and performance indices with the same rigour as for the 
classical methods. 
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